Exploiting Similarities among Languages for Machine Translation

نویسندگان

  • Tomas Mikolov
  • Quoc V. Le
  • Ilya Sutskever
چکیده

Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Structural Similarities of Philippine Languages for A Multilingual Machine Translation System

PinoyMMT, a multilingual machine translation system, was designed for Tagalog, Cebuano and English. It exploits structural similarities of the Philippine languages Tagalog and Cebuano, and handles the free word order phenomena. It has two modules: the analyzer and synthesizer. Analyzer parses the input sentence and converts it to its feature structure representation based on the rules and lexic...

متن کامل

Feature-based Decipherment for Large Vocabulary Machine Translation

Orthographic similarities across languages provide a strong signal for probabilistic decipherment, especially for closely related language pairs. The existing decipherment models, however, are not wellsuited for exploiting these orthographic similarities. We propose a log-linear model with latent variables that incorporates orthographic similarity features. Maximum likelihood training is comput...

متن کامل

Rule-based Machine Translation between Indonesian and Malaysian

We describe the development of a bidirectional rule-based machine translation system between Indonesian and Malaysian (id-ms), two closely related Austronesian languages natively spoken by approximately 35 million people. The system is based on the re-use of free and publicly available resources, such as the Apertium machine translation platform and Wikipedia articles. We also present our appro...

متن کامل

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...

متن کامل

Rule Based Approach for Machine Translation System for Related Languages: Punjabi to Hindi

Machine Translation is one of the important area in natural language processing. Machine Translation is a great challenge for closely related language pair. Machine Translation system for more or fewer related languages is based upon the similarities such as syntactic and vocabulary. Punjabi and Hindi both are originated from the same parent language so both are closely related and having lot o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1309.4168  شماره 

صفحات  -

تاریخ انتشار 2013